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Summary. We propose a Bayesian nonparametric approach to the problem of jointly modeling 

multiple related time series. Our approach is based on the discovery of a set of latent, shared 
dynamical behaviors. Using a beta process prior, the size of the set and the sharing pattern 
are both inferred from data. We develop efficient IVIarkov chain IVIonte Carlo methods based 
on the Indian buffet process representation of the predictive distribution of the beta process, 
without relying on a truncated model. In particular, our approach uses the sum-product algo- 
rithm to efficiently compute Metropolis-Hastings acceptance probabilities, and explores new 
dynamical behaviors via birth and death proposals. We examine the benefits of our proposed 
feature-based model on several synthetic datasets, and also demonstrate promising results on 
unsupervised segmentation of visual motion capture data. 

Keywords: beta process; hidden Markov model; Indian buffet process; Markov 
switching process; multiple time series; nonparametric Bayes. 

1. Introduction 

Classical time series analysis has generally focused on a single (potentially multivariate) 

time scries from which inferences arc to be made. For example, one might monitor the 
daily returns of a particular stock index and wish to infer the changing regimes of volatility. 
However, in a growing number of fields, interest is in making inferences based on a collection 
of related time series. One might monitor multiple financial indices, or collect EEG data 
from a given patient at multiple non-contiguous epochs. We focus on time series with 
dynamics that are too complex to be described using standard linear dynamical models 
(e.g., autoregressive processes), but that exhibit switches among a set of behaviors that 
describe locally coherent and simple dynamic modes that persist over a segment of time. 
For example, stock returns might be modeled via switches between regimes of volatility or 
an EEG recording between spiking patterns dependent on seizure type. In such cases, one 
would like to discover and model the dynamical behaviors which are shared among several 
related time series. In essence, we would like to capture a combinatorial form of shrinkage 
involving subsets of behaviors from an overall library of behaviors. 

As a specific motivating example that we consider later in this paper, consider a mul- 
tivariate time series that arises when position and velocity sensors are placed on the limbs 
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and joints of a person who is going through an exercise routine. In the specific dataset that 
we analyze, the time series can be segmented into types of exercise (e.g., jumping jacks, 
touch-the-toes and twists). The goal is to discover these exercise types (i.e., the "behav- 
iors") and their occurrences in the data stream. Moreover, the overall dataset consists of 
multiple time series obtained from multiple individuals, each of whom performs some subset 
of exercise types. We would like to take advantage of the overlap between individuals, such 
that if a "jumping-jack behavior" is discovered in the time series for one individual then it 
can be used in modeling the data for other individuals. 

A flexible yet simple method of describing single time series with such patterned be- 
haviors is the class of Markov switching processes. These processes assume that the time 
series can be described via Markov transitions between a set of latent dynamic behaviors 
which are individually modeled via temporally independent or linear dynamical systems. 
Examples include the hidden Markov model (HMM), switching vector autoregressive (VAR) 
process, and switching linear dynamical system (SLDS). These models have proven useful 
in such diverse fields as speech recognition, econometrics, neuroscience, remote target track- 
ing, and human motion capture. In this paper, we focus our attention on the descriptive 
yet computationally tractable class of switching VAR processes. In this case, the state, or 
dynamical mode, of the underlying Markov process encodes the dynamic behavior exhibited 
at a given time step and each dynamic behavior is a VAR process. That is, conditioned on 
the Markov-evolving state, the likelihood is simply a VAR model. 

To discover the dynamic behaviors shared between multiple time series, we propose 
a feature-based model. Globally, the collection of time series can be described by the 
shared library of possible dynamic behaviors. Individually, however, a given time series 
will only exhibit some subset of these behaviors. That is, each time series has a vocabulary 
of possible states. The goal in relating the time series is to discover which behaviors are 
shared amongst the time series and which are unique. Let us represent the vocabulary of 
time series ? by a feature vector with fik = 1 indicating that time series i has behavior 
k in its vocabulary. We seek a prior for these feature vectors. We particularly aim to allow 
flexibility in the number of total and time-series-specific behaviors, and to encourage time 
series to share similar subsets of the large set of possible behaviors. Our desiderata motivate 
a feature-based Bayesian nonparametric approach based on the beta process (Hjort, 1990; 
Thibaux and Jordan, 2007) . Such an approach allows for infinitely many potential dynamic 
behaviors, but encourages a sparse representation. 

In our scenario, one can think of the beta process as defining a coin-flipping probability 
for each of an infinite set of possible dynamic behaviors. Each time series' feature vector 
is modeled as the result of a Bernoulli process draw: the beta-process-determined coins are 
flipped for each dynamic behavior and the set of resulting heads indicate the set of selected 
features (implicitly defining an infinite-dimensional feature vector.) The properties of the 
beta process induce sparsity in the feature space by encouraging sharing of features among 
the Bernoulli process observations. Specifically, the total sum of coin weights is finite and 
only certain dynamic behaviors have large coin weights. Thus, certain dynamic behaviors 
are more prevalent in the vocabularies of the time series, though the resulting vocabularies 
clearly need not be identical. As shown by Thibaux and Jordan (2007), integrating over the 
latent beta process random measure (i.e., coin- flipping weights) induces a predictive distri- 
bution on features known as the Indian buffet process (IBP) (Griffiths and Ghahramani, 
2005). Computationally, this representation is key. Given a sampled feature set, our model 
reduces to a collection of finite Bayesian VAR processes with partially shared parameters. 

Our presentation is organized as follows. The beta process is reviewed in Section 2.3, 
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following a brief overview of Markov switching processes. In Section 3, we present our 
proposed beta-process-based model for jointly modeling multiple related Markov switching 
processes. Efficient posterior computations based on a Markov chain Monte Carlo (MCMC) 
algorithm are developed in Section 4. The algorithm does not rely on model truncation; 
instead, we exploit the finite dynamical system induced by a fixed set of features to efficiently 
compute acceptance probabilities, and reversible jump birth and death proposals to explore 
new features. The sampling of features relies on the IBP interpretation of the beta process — 
the connection between the beta process and the IBP is outlined in Section 4.1. In Section 5, 
we describe related approaches. Section 6 examines the benefits of our proposed feature- 
based model on several synthetic datasets. Finally, in Section 7 we present promising results 
on the challenging task of unsupervised segmentation of data from the CMU motion capture 
database (CMU, 2009). 

2. Background 

2. 1. Markov Switcliing Processes 

Hidden Markov Models 

The hidden Markov model, or HMM, is a class of doubly stochastic processes based on an 
underlying, discrete- valued state sequence that is modeled as Markovian (Rabiner, 1989). 
Conditioned on this state sequence, the model assumes that the observations, which may 
be discrete or continuous valued, are independent. Specifically, let zt denote the state, or 
dynamical mode, of the Markov chain at time t and let tt^ denote the state-specific transition 
distribution for mode j. Then, the Markovian structure on the mode sequence dictates that 

Zt I zt_i - 7r^,_i. (1) 

Given the mode Zj, the observation yt is a conditionally independent emission 

yt\zt^F{e,J (2) 

for an indexed family of distributions F[-). Here, 9i are the emission parameters for mode i. 



Switching VAR Processes 

The modeling assumption of the HMM that observations are conditionally independent 
given the latent mode sequence is often insufficient in capturing the temporal dependencies 
present in many datasets. Instead, one can assume that the observations have conditionally 
linear dynamics. The latent HMM dynamical mode then models switches between a set 
of such linear dynamical systems in order to capture more complex dynamical phenomena. 
We restrict our attention in this paper to switching vector autoregressive (VAR) processes, 
or autoregressive HMMs ( AR-HMMs) , which are broadly applicable in many domains while 
maintaining a number of simplifying properties that make them a practical choice compu- 
tationally. 

We define an AR-HMM, with switches between order-r vector autoregressive processes f, 

as 

r 

Vt^Yl ^^^^t^t-^ + ^t{zt), (3) 

i=l 



fWe denote an order-r VAR process by VAR(r). 
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where zt represents the HMM latent dynamical mode of the system at time i, and is defined 
as in Eq. (1). The mode-specific additive noise term is distributed as ef(zf) ^ A/'(0, S^j). 
We refer to Ak ~ {^i.fc, • • ■ , ^r.k} as the set of lag matrices. Note that the standard HMM 
with Gaussian emissions arises as a special case of this model when A^. = for all k. 

2.2. Relating Multiple Time Series 

In our applications of interest, we are faced with a collection of N time series representing 
realizations of related dynamical phenomena. We assume that each time series is individu- 
ally modeled via a switching VAR process, as in Equation (3). Denote the VAR parameters 
for the fc*'' dynamical mode as 9k = {Ak, S^}, and assume that we have an unbounded set 
of possible VAR models {9i, 02, ■ ■ .}■ For example, these parameters might each define a 
linear motion model for the behaviors walking, running, jumping, and so on; our time series 
are then each modeled as Markov switches between these behaviors. We will sometimes 
avail ourselves the convenient shorthand of referring to k itself as a "behavior," where the 
intended meaning is the VAR model parameterized by 9k. 

The way in which our TV time series are related is by the overlap in the set of dynamic 
behaviors that each exhibits. For example, imagine that our N time series represent ob- 
servation sequences from the exercise routines of N people. We expect there to be some 
overlap in the behaviors exhibited, but also some variability — e.g., some people may solely 
switch between walking and running, while others switch between running and jumping. 

One can represent the set of behaviors available to each of the time series models with 
a list of binary features. In particular, let fi = [fn, fi2, . . ■] denote a binary feature vector 
for the i*'' time series. Setting fik = 1 implies that time series i exhibits behavior k 
for some subset of values t G {1, . . . ,Ti}, where Ti is the length of the i*^ time series. 
Our proposed featural model defines N such infinite-dimensional feature vectors, one for 
each time series. By discovering the pattern of behavior-sharing via a featural model (i.e., 
discovering fik — fjk = 1 for some i,j, k), we can interpret how the time series relate to one 
another in addition to harnessing the shared structure to pool observations from the same 
behavior, thus improving our estimate of 9k. 

2.3. Beta Processes 

Inferring the structure of behavior sharing within a Bayesian framework requires defining 
a prior on the feature inclusion probabilities. Since we want to maintain an unbounded 
set of possible behaviors (and thus require infinite-dimensional feature vectors) , we appeal 
to a Bayesian nonparametric featural model based on the beta process- Bernoulli process. 
Informally, one can think of the beta process as defining an infinite set of coin-flipping prob- 
abilities and each Bernoulli process realization is the outcome from an infinite coin-flipping 
sequence based on the beta-process-determined coin weights. The set of resulting heads 
indicate the set of selected features, and implicitly deflnes an inflnite-dimensional feature 
vector. The properties of the beta process induce sparsity in the feature space by encourag- 
ing sharing of features among the Bernoulli process realizations. The inherent conjugacy of 
the beta process to the Bernoulli process allows for an analytic predictive distribution on a 
feature vector (i.e., Bernoulli realization) based on the feature vectors observed so far (i.e., 
previous Bernoulli process draws). As outlined in Section 4.1, this predictive distribution 
can be described via the Indian buffet process under certain parameterizations of the beta 
process. 
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The Beta Process - Bernoulli Process Featural Model 

The beta process is a special case of a general class of stochastic processes known as com- 
pletely random measures (Kingman, 1967). A completely random measure B is defined such 
that for any disjoint sets Ai and A2 (of some sigma algebra ^ on a measurable space 6), the 
corresponding random measures B{Ai) and B(A2) are independent. This idea generalizes 
the family of independent increments processes on the real line. All completely random 
measures can be constructed from realizations of a nonhomogenous Poisson process (up 
to a deterministic component) (Kingman, 1967). Specifically, a Poisson rate measure 77 is 
defined on a product space (X) M, and a draw from the specified Poisson process yields a 
collection of points {9j,ujj} that can be used to define a completely random measure: 

00 

B^Y.'^kSe,. (4) 

k=l 

This construction assumes 77 has infinite mass, yielding the countably infinite collection of 
points from the Poisson process. From Eq. (4), we see that completely random measures 
are discrete. Consider a rate measure defined as the product of an arbitrary sigma-finite 
base measure Bq, with total mass Bq{Q) = a, and an improper beta distribution on the 
product space Q (E) [0, 1]: 

v{duj,de) = cuj-^{l - ijjf-^dujBti{de), (5) 

where c > is referred to as a concentration parameter. The resulting completely random 
measure is known as the beta process with draws denoted by -B ~ BP(c, -Bq) |. Note that 
using this construction, the weights oJk of the atoms in B lie in the interval (0, 1). Since rj 
is tr-finite, Campbell's theorem (Kingman, 1993) guarantees that for a finite, B has finite 
expected measure. For an example realization and its associated cumulative distribution, 
see Fig. 1. 

Note that for a base measure Bq containing atoms, a sample B ^ BP(c, Bq) necessarily 
contains each of these atoms Ok with associated weights 

ujk ^ Beta(cgfe, c(l - qt)), (6) 

where qk E (0, 1) denotes the mass of the fc*'' atom in Bq. 

The beta process is conjugate to a class of Bernoulli processes (Thibaux and Jordan, 
2007), denoted by BeP(i3), which provide our sought-for featural representation. A realiza- 
tion 

X,\B^BeP{B), (7) 

with B an atomic measure, is a collection of unit-mass atoms on Q located at some subset 
of the atoms in B. In particular, 

fik Bernoulh(a;fe) (8) 

^Letting the rate measure be defined as a product of a base measure Go and an improper 
gamma distribution rj{d6,dLo) — cp~^e~'^^dpGo{d6), with c > 0, gives rise to completely random 
measures G ~ GP(c, Go), where GP denotes a gamma process. Normalizing G yields draws from 
a Dirichlet process DP(Q,Go/a), with a — Go(6). Note that these random probability measures 
G are necessarily not completely random since the random variables G{Ai) and G(A2) for disjoint 
sets Ai and A2 are dependent due to the normalization constraint. 
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(a) (b) 

Fig. 1. (a) Top: A draw B from a beta process is shown in blue, witli tine corresponding cumulative 
distribution in red. Bottom: 50 draws Xi from a Bernoulli process using the beta process realization. 
Each blue dot corresponds to a coin-flip at that atom in B that came up heads, (b) An image of a 
feature matrix associated with a realization from an Indian buffet process with a = 10. Each row 
corresponding to a different customer, and each column a different dish. White indicates a chosen 
feature. 



is sampled independently for each atom 9^ in _B §, and then 

k 

Example realizations of Xi ^ BeP(i?), with B a draw from a beta process, are shown in 
Fig. 1(a). 

For continuous measures B, we draw L Poisson(i?(0)) and then independently sample 
a set of L atoms 9e ~ B{Q)~^B. The Bernoulli realization is then given by: 

L 

x, = Y,K- (10) 

1=1 

In our subsequent development, we interpret the atom locations 9^ as a set of global 
features that can be shared among multiple time series. A Bernoulli process realization X^ 
then determines the subset of features allocated to time series i: 

B I Bo,c^BP(c,Bo) 
X, |B^BcP(B), i = \,...,N. (11) 

Computationally, Bernoulli process realizations Xi are often summarized by an infinite 
vector of binary indicator variables — [fa, /i2i • • -li where fik — 1 if and only if time series i 
exhibits feature k. Using the beta process measure B to tie together the feature vectors 
encourages them to share similar features while allowing time-series-specific variability. 

§One can visualize this process as walking along the atoms of a discrete measure B and, at each 
atom 6k , flipping a coin with probability of heads given by uuk ■ 
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3. Describing IVIultiple Time Series with Beta Processes 

We employ the beta process featural model of Section 2.3 to define a prior on the collection 
of m/inite-dimensional feature vectors = [fn, fi2,-- •] used to describe the relationship 
amongst our N time series. Recall from Section 2.2 that the globally-shared parameters 6^ 
define the possible befiaviors (e.g., VAR processes), while the feature vector indicates the 
behaviors exhibited by time series i. 



Beta Process Prior on Features 

In our scenario, the beta process hierarchy of Equation (11) can be interpreted as follows. 
The random measure B ~ BP(c, Sq) defines a set of weights on the global collection of 
behaviors. Then, each time series i is associated with a draw from a Bernoulli process, 
Xi \ B ^ BeP(i?). The Bernoulli process realization Xi — fik^Sk implicitly defines the 
feature vector /j for time series i, indicating which set of globally-shared behaviors that 
time series has selected. Such a featural model seeks to allow for infinitely many possible 
behaviors, while encouraging a sparse, finite representation and fiexible sharing of behaviors 
between time series. For example, the lower subfigure in Fig. 1(a) illustrates a collection of 
feature vectors drawn from this process. 

Conditioned on the set of N feature vectors /j, i — 1, . . . ,N drawn from the hierarchy 
of Equation (11), the model reduces to a collection of N switching VAR processes, each 
defined on the finite state space formed by the set of selected behaviors for that time series. 
In the following section, we define the generative process for the Markov dynamics based 
on the sampled feature vectors. 



Feature-Constrained Transition Distributions 

Given /j, the i*'' time series's Markov transitions among its set of dynamic behaviors are 
governed by a set of feature- constrained transition distributions tt^'^ = {""fc ''}■ I^i particular, 
motivated by the fact that Dirichlet-distributed probability mass functions can be generated 
via normalized gamma random variables, for each time series i we define a doubly infinite 
collection of random variables: 



'7lfc I 7, K - Gamma(7 -I- KS{j, k), 1), 



(12) 



Here, 5{j, k) indicates the Kronecker delta function. Using this collection of transition 
variables, denoted by 17*^*', one can define time-series-specific, feature-constrained transition 
distributions: 



(i) 



f^ 



k\f,k 



(13) 



where igi denotes the element-wise, or Hadamard, vector product. This construction defines 
Trj'-* over the full set of positive integers, but assigns positive mass only at indices k where 
fik = 1, thus constraining time series i to solely transition amongst the dynamical behaviors 
indicated by its feature vector 

The preceding generative process can be equivalently represented via a sample tt^*'' from 
a finite Dirichlet distribution of dimension Ki = J^k fiki containing the non-zero entries of 
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T^j'^ I /j,7, '«^Dir([7, ...,7,7 + K,7, ...7]). (14) 

The K hyperparameter places extra expected mass on the component of irj*'' correspond- 
ing to a self-transition tt^^j , analogously to the sticky hyperparameter of the sticky HDP- 
HMM (Fox et al., 2011b). We also use the representation 

Trf I /,;,7,K^Dir([7,...,7,7-f K,7,...]®/,), (15) 



implying tt^^^ = 7r''V tt^^ . . . i with only a finite number of non-zero entries ttJJ^. This 



3 



.(0 J^) 



jk- 



representation is really an abuse of notation since the Dirichlet distribution is not defined for 
infinitely many parameters. In reality, we are simply examining a if^ -dimensional Dirichlet 
distribution as in Eq. (14). However, the notation of Eq. (15) is useful in reminding the 
reader that the indices of tt^*'' defined by Eq. (14) are not over 1 to Ki, but rather over 
the Ki values of k such that fik = 1. Additionally, this notation is useful for concise 
representations of the posterior distribution. 



VAR Likelihoods 

Although the methodology described thus far applies equally well to HMMs and other 
Markov switching processes, henceforth we focus our attention on the AR-HMM and develop 
the full model specification and inference procedures needed to treat our motivating example 
of visual motion capture. Specifically, let yj*"* represent the observed value of the i*'* time 
series at time t, and let z^'^ denote the latent dynamical mode. Assuming an order-r 
AR-HMM, we have 

where ^ (fc) ~ M{0, Efc), ^fc = [Ai,k ■ ■ ■ Ar,k] , and ^ = [y^!^i . . . y^!^"] ^. Re- 
call that each of the behaviors 6k = {A^, Sfe} defines a different VAR(r) dynamical mode 
and the feature-constrained transition distributions tt*^*) restrict time series i to select among 
dynamic behaviors (indexed at time t by z^''') that were picked out by its feature vector f^. 
Our beta-process-based featural model couples the dynamic behaviors exhibited by different 
time series. 



Prior on VAR Parameters 

To complete the Bayesian model specification, a conjugate matrix-normal inverse- Wishart 
(MNIW) prior (cf.. West and Harrison (1997)) is placed on the shared collection of dynamic 
parameters 9k — {A^, Sj,}. Specifically, this prior is comprised of an inverse Wishart prior 
on Sfe and (conditionally) a matrix normal prior on Ak'. 



Sfe I no, So ^ IW(no,S'o) 
Ak I J:k,M,K ^ MJ^{Ak;M,J:k,K) , 



(17) 




Fig. 2. Graphical model of the BP-AR-HMM. The beta process distributed measure 
_B I Bo ~ BP(l,Bo) is represented by its masses ujk and locations 6^, as in Eq. (4). The features 
are then conditionally independent draws fik \ o^k ~ Bernoulli(wi,), and are used to define feature- 
constrained transition distributions tt]*' | /j,7, k ~ Dir([7, . . . ,7,7 + k, 7, . . . ] ® /,). The switching 
VAR dynamics are as in Eq. (16). 



with no the degrees of freedom, Sa the scale matrix, M the mean dynamic matrix, and K 
a covariance matrix that together with defines the covariance of Ak- This prior defines 
the base measure up to the total mass parameter a, which has to be separately assigned. 
As motivated in Section 4.4, this latter parameter is given a gamma prior. 

Since the library of possible dynamic parameters is shared by all time series, posterior 
inference of each parameter set Ok relies on pooling data amongst the time series that have 
fik = 1. It is through this pooling of data that one may achieve more robust parameter 
estimates than from considering each time series individually. 



The BP-AR-HMM 

We term the resulting model the BP-autoregressive-HMM (BP-AR-HMM), with a graph- 
ical model representation presented in Fig. 2. Considering the feature space (i.e., set of 
autoregressive parameters) and the temporal dynamics (i.e., set of transition distributions) 
as separate dimensions, one can think of the BP-AR-HMM as a spatio-temporal process 
comprised of a (continuous) beta process in space and discrete-time Markovian dynamics 
in time. The overall model specification is summarized as: 

B\Bo^BP{l,Bo) 
X,\B^BcP{B), i^l,...,N 

'^f I fiil.i^^ Dir([7, . . . ,7,7 K, 7, ...]«) /,), f = 1,. . . ,iV, j = 1,2,. . . ^^^^ 

, z = l,...,iV, i = l,...,T, 

y« = A^(.,y« + e«(z«), z = 1, . . . ,iV, i = 1, . . . ,T,. 
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4. MCMC Posterior Computations 

In this section, we develop an MCMC method which ahernates between samphng binary 
feature assignments given observations and dynamic parameters, and samphng dynamic pa- 
rameters given observations and features. The sampler interleaves Metropolis-Hastings and 
Gibbs sampling updates, which are sometimes simplified by appropriate auxiliary variables. 
We leverage the fact that fixed feature assignments instantiate a set of finite AR-HMMs, 
for which dynamic programming can be used to efficiently compute marginal likelihoods. 
Computationally, sampling the potentially infinite set of time-series-specific features in our 
beta process featural model relies on a predictive distribution on features that can be de- 
scribed via the Indian buffet process (IBP) (Griffiths and Ghahramani, 2005). The details 
of the IBP are outlined below. As a key component of our feature-sampling, we introduce a 
new approach employing incremental "birth" and "death" proposals, improving on previous 
exact samplers for IBP models in the non-conjugate case (Meeds et al., 2007). 

4. 1. Background: The Indian Buffet Process 

As shown by Thibaux and Jordan (2007), marginalizing over the latent beta process B in 
the hierarchical model of Equation (11), and taking c = 1, induces a predictive distribution 
on feature indicators known as the Indian buffet process (IBP) (Griffiths and Ghahramani, 
2005). The IBP is based on a culinary metaphor in which customers arrive at an infinitely 
long buffet line of dishes, or features [behaviors in our case). The first arriving customer, 
or time series in our case, chooses Poisson(a) dishes. Each subsequent customer i selects 
a previously tasted dish k with probability nik/i proportional to the number of previous 
customers to sample it, and also samples Poisson(Q;/i) new dishes. The feature matrix 
associated with a realization from an Indian buffet process is shown in Fig. 1(b). 

To derive the IBP from the beta process formulation described above, we note that 
the probability Xi contains feature Ok after having observed Xi, . . . is equal to the 

expected mass of that atom: 

p(/,fc = 1 I Xl,...,X,.l)=EslXu...,X,-APif^k = 1 I B)]^EBlXu....X,-A^k], (19) 

where our notation Eb[-] means to take the expectation with respect to the distribution 
of B. Because beta process priors are conjugate to the Bernoulli process (Kim, 1999), 
the posterior distribution given samples Xi ~ BeP(i3) is a beta process with updated 
parameters: 

B\X,....,X„,B„,c~BpL + N, B„ + Jf.) (20) 

-Bp(c + iV.^B. + |^.,.), (21) 

Here, ruk denotes the number of time series Xi that select the k*^ feature 9k (i.e., fik — 1). 
For simplicity, we have reordered the feature indices to list first the features used by at 
least one time series. 

Using the posterior distribution defined in Eq. (21), we consider the discrete and con- 
tinuous portions of the base measure separately. The discrete component is a collection of 
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atoms at locations 9i, . . . , 0k^ , each with weight 

where is the number of unique atoms present in Xi, . . . , Xi_i. For each of the currently 
instantiated features /c G {1, . . . , K^}, we have 

UJk Beta((c + i - l)qk, (c + i - 1)(1 - g^)) (23) 

such that the expected weight is simply qk, implying that the i*'' time series chooses one 
of the currently instantiated features with probability proportional to the number of time 
series that already chose that feature, rrik- We now consider the continuous portion of the 
base measure, 

Bo- (24) 



The Poisson process defined by this rate function generates 

Poisson ^— pi — -BoiO)^ = Poisson ^— p| — -a^ (25) 

new atoms in Xi that do not appear in Xi, . . . ,Xi-i. Following this argument, the first 
time series simply chooses Poisson(a) features. If we specialize this process to c = 1, we 
arrive at the IBP. 



4.2. Sampling binary feature assignments 

Let F^^'^ denote the set of all binary feature indicators excluding fik, and K^"^ be the 
number of behaviors used by all of the other time series For notational simplicity, 
we assume that these behaviors are indexed by {1, . . . ,K^^}. The IBP prior differentiates 
between features, or behaviors, that other time series have already selected and those unique 
to the current time series. Thus, we examine each of these cases separately. 



Shared features 

Given the time series y[*^. , transition variables jy*^*^ = ^i^k^' i ^^'^ shared dynamic 

parameters d^.j^-i, the feature indicators fik for currently used features k G {1, . . . , K^^} 
have the following posterior distribution: 

p{f,k I F-*^y« ,r,W,0,^^-.,a) (xp(/.fc | F-'V)p(y« | /„ r,«, 0,^^-0- (26) 



Here, the IBP prior described in Section 2.3 implies that p{fik = 1 | a) = m^^/N, 

where denotes the number of time series other than time series i that exhibit behavior 
k. In evaluating this expression, we have exploited the exchangeability of the IBP (Griffiths 
and Ghahramani, 2005), which follows directly from the beta process construction (Thibaux 

^Some of the K^^ features may also be used by time series i, but only those not unique to that 
time series. 
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and Jordan, 2007). For binary random variables, Metropolis-Hastings proposals can mix 
faster (Frigcssi et al., 1993) and have greater efficiency (Liu, 1996) than standard Gibbs 
samplers. To update fik given F~''^, we thus use the posterior of Eq. (26) to evaluate a 
Metropolis-Hastings proposal which flips fn- to the complement / of its current value /: 

u - p{f I mu, I) + (1 - p{i I imhuj) 

M/.fc = /|F-^^y«^,,7W,0,^^-.,a)' ] 

To compute likelihoods, we combine and ry*^'-* to construct feature-constrained transition 
distributions tt^*'* as in Eq. (13), and marginalize over the exponentially large set of possible 
latent mode sequences by applying a variant of the sum-product message passing algorithm 
for AR-HMMs. (See Appendix A.) 



P{f I /) = mill 



Unique features 

An alternative approach is needed to sample the Poisson(a/A^) "unique" features associated 
only with time series i. Let = K^^ + n^, where rii is the number of unique features 
chosen, and define = i k^^ ^^'^ f+i ~ fi k~'+i-k+- posterior distribution over 
Ui is then given by 

/ I J! («) (i) n \ (]v)"'^ " 

P[nt I /i,yi:T,:»7^ ^^'l:i^-,a) « 1 

+ Ti l . 

P{y%^ I = l,r,W,r7+,0,^^-.,0+) dBo{e+)dH{rj^), (28) 

where H is the gamma prior on transition variables 77^^, , and we recall that -Bo is the base 
measure of the beta process. The set 0+ = ^/^-i+i./f^ consists of the parameters of unique 

features, and t]_^ the transition parameters ryj^' to or from unique features j, k e {K^^ + 1 : 
-fT+j. Exact evaluation of this integral is intractable due to dependencies induced by the 
AR-HMMs. 

One early approach to approximate Gibbs sampling in non-conjugate IBP models relies 
on a finite truncation of the limiting Bernoulli process (Goriir et al., 2006). That is, drawing 
Hi ^ Poisson(a/iV) distribution is equivalent to setting equal to the number of successes 
in infinitely many Bernoulli trials, each with probability of success 

li-i (29) 

K^oo a/K + N ^ ' 

Goriir et al. (2006) truncate this process and instead consider K* Bernoulli trials with prob- 
ability {a/ K*)/ {a/ K* + N). Meeds et al. (2007) instead consider independent Metropolis 
proposals which replace the existing unique features by ~ Poisson(a/-/V) new features, 
with corresponding parameters 0+ drawn from the prior. For high-dimensional models such 
as those considered in this paper, however, such moves have extremely low acceptance rates. 

We instead develop a birth and death reversible jump MCMC sampler (Green, 1995), 
which proposes to either add a single new feature, or eliminate one of the existing features 
in Our proposal distribution factors as follows: 

g(/;„0'+- V+ I = I f+i)<le{0\ I 0+)g,(r7'+ | »7+) (30) 
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Let rii ~ J2kf+ik- The feature proposal g/(- | •) encodes the probabiUties of birth and death 
moves, which we set as follows: A new feature is created with probability 0.5, and each of 
the Tii existing features is deleted with probability 0.5 /ui. This set of possible proposals 
leads to considering transitions from to n'^ unique features, with n'^ — rii + I in the case 
of a birth proposal, or = — 1 in the case of a proposed feature death. Note that if the 
proposal from the distribution defined in Eq. (30) is rejected, we maintain n'^ = rii unique 
features. For parameters, we define our proposal using the generative model: 

n (a' \ f' f n _ ( bo{O+,n,+i)UkU^e+.ki0+,k)^ birth of feature + 1; 

qe(tf+ I «+j - I H^^^Se^je'^^,), death of feature £. ^"^^^ 

That is, for a birth proposal, a new parameter 0^ is drawn from the prior and all other 
parameters remain the same. For a death proposal of feature j, we simply eliminate that 
parameter from the model. Here, bo is the density associated with a~^BQ. The distribution 
g,;(- I •) is defined similarly, but using the gamma prior on transition variables of Eq. (12). 
The Metropolis-Hastings acceptance probability is then given by 

p{fU,d'+,v'+ I /+„0+,r7+) =min{r(/;„0V,r,; | 0+, ,7+), 1}. (32) 

As derived in Appendix B, we compactly represent the acceptance ratio r(- | •) for either a 
birth or death move as 



P(yl:T. I [f-^fU(^i--K^,0+,V^''\rj'+) Poisson(n', | a/N) gy(/^, | /;,) 
P{y[%, I [/-./+J,0i:K+,r,W) Poisson(n, | a/N) qf{f^, \ /+,) 



(33) 



where we recall that = '^kf'+ik- Because our birth and death proposals do not modify 
the values of existing parameters, the Jacobian term normally arising in reversible jump 
MCMC algorithms simply equals one. 



4.3. Sampling dynamic parameters and transition variables 

Posterior updates to transition variables 77*^*^ and shared dynamic parameters 9k are greatly 
simplified if we instantiate the mode sequences zj.^. for each time series i. We treat these 
mode sequences as auxiliary variables that are discarded for subsequent updates of feature 
assignments /j. 



Mode sequences zj*^. 

Given feature-constrained transition distributions tt^*^ and dynamic parameters {Ok}, along 
with the observation sequence y\.rp. , we block sample the mode sequence z\.j,, by computing 
backward messages mt+i.t{z[^^) cx viv^t+i-Ti I ^\^\y\-\'^^^\{^k})-, and then recursively 
sampling each z^*-*: 

z« I z«i,y«^,7rW,{0a ^ (zf))AA(y(^); A^<,yf\s^,,)m,+i,(z«). (34) 
This backward message-passing, forward-sampling scheme is detailed in Appendix A. 
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(i) 

Transition distributions tt^ 

(i) 

We use the fact that Dirichlet priors are conjugate to multinomial observations z}.^ to 
derive the posterior of tt^*'' as 

nf I f„ z|:^,7, n ^ Dh([7 + n^l . . . ,7 + n^_„J + k + n«,7 + n^+„ • ■ •] ® /,)■ (35) 

(i) . . . (i) 

Here, n^^; are the number of transitions from mode j to k in zl.rp. Since the mode sequence 
z[^}p was generated from feature-constrained transition distributions, will be zero for 
any k such that fik = 0. Using the definition of vrj''' in Eq. (13), one can equivalently define 
a sample from the posterior of Eq. (35) by solely updating ryj^ for instantiated features: 

vS I 4%^ 7, « Gamma(7 + i^SU, k) + ng, 1), k^{£\fa^ 1}. (36) 



Dynamic parameters {^4^, Sfc} 

We now turn to posterior updates for dynamic parameters. Recall the conjugate matrix 
normal inverse- Wishart (MNIW) prior on {A^, S^}, comprised of an inverse- Wishart prior 
IW(no,S'o) on and a matrix- normal prior MM {Ak;M,TikiK) on Ak given Sfc. We 
consider the following sufficient statistics based on the sets Y)j = {y^*^ | z'f^ = k, i = 
1, . . . , N} and Y^. — {y'f'' \ z^f' = fc, z = 1, . . . , A^} of observations and lagged observations, 
respectively, associated with behavior fc: 

= 1^ YtYt +K sy = 2^ yry* +MK 

(t,j)lz<''=fe 

It is through this pooling of data from multiple time series that we improve our inferences 
on shared behaviors, especially in the presence of limited data. Using standard MNIW 
conjugacy results, the posterior can be shown to equal 

m ^ iw (im + no, + 



4.4. Sampling the BP and Dirichlet transition hyperparameters 

We additionally place priors on the Dirichlet hyperparameters 7 and k, as well as the BP 
parameter a. 



BP hyperparameter a 

Let F = {fi}. As derived by Griffiths and Ghahramani (2005), p{F \ a) can be expressed 
as 

p{F I a) cx exp ( - a E " ) ' (^^) 

\ n=l ^ ' 
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where, as before, K+ is the number of unique features activated in F. As in Goriir et al. 
(2006), we place a conjugate Gamnia(aQ,, 6q,) prior on a, which leads to the following pos- 
terior distribution: 

/ 1 

p{a I F,aa, ba) oc exp —a \^ — 

\ n—l 

= Ganima( + K+,ba + X/ ^ ) ("^^^ 

^ n=l 



exp(— 6c 



a 



T{a) 



Transition hyperparameters 7 and k 

Transition hyperparameters are assigned priors 7 Gamma(a^, bj) and k ~ Gamma(aK, &«). 
Because the generative process of Eq. (12) is non-conjugate, we rely on Metropolis-Hastings 
steps which iteratively sample 7 given k, and k given 7. Each sub-step uses a gamma pro- 
posal distribution g^(- | •) or qi^(- \ •), respectively, with fixed variance cr^ or ct^, and mean 
equal to the current hyperparameter value. 

As derived in Appendix C, the acceptance ratio for for the proposal of 7 given k, is 



where = 7^/(7^, -d' = 7'^/iT^, and f{j) is the likelihood term. Specifically, letting tt = 

{tt^*^} and recalling the definition of 7rj'^ from Eq. (14) and that Ki — J^k fik, the likelihood 
term may be written as 

/h, . p(. I . F, . n n I ^ , ft *i:f 1 ^ m 

» fe=i [ (11^=1 r(7) j r(7 + k) J 

The Metropolis-Hastings sub-step for sampling k given 7 follows similarly. In this case, 
however, the likelihood terms simplifies to 



/(») ^ n '^v^ n *5f « I n (43) 



. r(7 + K)''. 

The resulting MCMC sampler for the BP-AR-HMM is summarized in Algorithm 1 of Ap- 
pendix D. 



5. Related Work 

A challenging problem in deploying Markov switching processes such as the AR-HMM 
is that of defining the number of dynamic regimes. Previously, Bayesian nonparametric 
approaches building on the hierarchical Dirichlet process (HDP) (Teh et al., 2006) have 
been proposed to allow uncertainty in the number of regimes by defining Markov switching 
processes on infinite state spaces (Beal et al., 2002; Teh et al., 2006; Fox et al., 2011b, a). 
See Fox et al. (2010a) for a recent review. However, these formulations focus on a single time 
series whereas in this paper our motivation is analyzing a collection of time series. A naive 
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approach to employing such models in the multiple time series setting is to simply couple 
each of the time series under a shared HDP prior. However, such an approach assumes 
that the state spaces of the multiple Markov switching processes are exactly shared, as are 
the transitions among these states (i.e., both the transition and emissions parameters are 
global.) As demonstrated in Section 6 and Section 7, such strict sharing can limit the ability 
to discover unique dynamic behaviors and reduce the predictive performance of the inferred 
model. 

In recent independent work, Saria ct al. (2010) developed an alternative approach to 
modeling multiple time series via the HDP-HMM. Their time series topic model (TSTM) 
describes coarse-scale temporal behavior using a finite set of "topics" , which are themselves 
distributions on a common set of autoregressive dynamical models. Each time series is 
assumed to exhibit all topics to some extent, but with unique frequencies and temporal 
patterns. Alternatively, the mixed HMM (Altman, 2007) uses generalized linear models to 
allow the state transition and emission distributions of a finite HMM to depend on arbitrary 
external covariates. In experiments, this is used to model the differing temporal dynamics 
of a small set of known time series classes. 

More broadly, the specific problem we address here has received little previous attention, 
perhaps due to the difficulty of treating such combinatorial relationships with parametric 
models. There are a wide variety of models which capture correlations among multiple 
aligned, interacting univariate time series, for example using Gaussian state space mod- 
els (Aoki and Havenner, 1991). Other approaches cluster time series using a parametric 
mixture model (Alon et al., 2003), or a Dirichlet process mixture (Qi et al., 2007), and 
model the dynamics within each cluster via independent finite HMMs. 

Dynamic Bayesian networks (Murphy, 2002), such as the factorial HMM (Ghahramani 
and Jordan, 1997), define a structured representation for the latent states underlying a 
single time series. Such models are widely used in applied time series analysis (Lchrach 
and Husmcicr, 2009; Duh, 2005). The infinite factorial HMM (Van Gael et al, 2009) uses 
the IBP to model a single time series via an infinite set of latent features, each evolving 
according to independent Markovian dynamics. Our work instead focuses on modeling 
multiple time series and on capturing dynamical modes that are shared among the series. 

Other approaches do not explicitly model latent temporal dynamics, and instead aim to 
align time series with consistent global structure (Aach and Church, 2001). Motivated by 
the problem of detecting temporal anomalies, Listgarten et al. (2007) describe a hierarchical 
Bayesian approach to modeling shared structure among a known set of time series classes. 
Independent HMMs are used to encode non-linear alignments of observed signal traces to 
latent reference time series, but their states do not represent dynamic behaviors and are 
not shared among time series. 

6. Synthetic Experiments 

6. 1. Discovering Common Dynamics 

To test the ability of the BP-AR-HMM to discover shared dynamics, we generated five time 
series that switched between AR(1) models: 

Vt - a. (^)yt__i + (Zj ), (44) 

with Ok e {—0.8, —0.6, —0.4, —0.2, 0, 0.2, 0.4, 0.6, 0.8} and process noise covariance Sfc drawn 
from an IW(3, 0.5) prior. The time-series-specific features, shown in Fig. 3(b), were sampled 
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Fig. 3. (a) Observation sequences for each of 5 switching AR(1) time series colored by true mode 
sequence, and offset for clarity. Images of the (b) true feature matrix of the five time series and 
(c) estimated feature matrix averaged over 10,000 MCMC samples taken from 100 trials every 10th 
sample. Each row corresponds to a different time series, and each column a different autoregressive 
model. White indicates active features. Although the true model is defined by only 9 possible dynam- 
ical modes, we show 20 columns in order to display the "tail" of the BP-AR-HMM estimated matrix 
resulting from samples that incorporated additional dynamical modes (events that have positive prob- 
ability of occurring, as defined by the IBP prior.) The estimated feature matrices are produced from 
mode sequences mapped to the ground truth labels according to the minimum Hamming distance 
metric, and selecting modes with more than 2% of the observations in a time series. 



from a truncated IBP (Griffiths and Ghahramani, 2005) using a = 10 and then used to 
generate the observation sequences of Fig. 3(a) (colored by the true mode sequences). Each 
row of the feature matrix corresponds to one of the five time series, and the columns 
represent the different autoregressive models with a white square indicating that a given 
time series uses that dynamical mode. Here, the columns are ordered so that the first feature 
corresponds to an autoregressive model defined by oi, and the ninth feature corresponds to 
that of og. 

The resulting feature matrix estimated over 10,000 MCMC samples is shown in Fig. 3(c). 
Each of the 10,000 estimated feature matrices is produced from an MCMC sample of the 
mode sequences that are first mapped to the ground truth labels according to the minimum 
Hamming distance metric. We then only maintain inferred dynamical modes with more 
than 2% of the time series's observations. Comparing to the true feature matrix, we see 
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that our model is indeed able to discover most of the underlying latent structure of the 
time series despite the challenges caused by the fact that the autoregressive coefficients are 
close in value. The most commonly missed feature occurrence is the use of 04 by the fifth 
time series. This fifth time series is the top-most displayed in Fig. 3(a), and the dynamical 
mode defined by 04 is shown in green. We see that this mode is used very infrequently, 
making it challenging to distinguish. Due to the nonparametric nature of the model, we also 
see a "tail" in the estimated matrix because of the (infrequent) incorporation of additional 
dynamical modes. 

6.2. Comparing the Feature- Based Model to Nonparametric Models with Identical State 
Spaces 

One might propose, as an alternative to the BP-AR-HMM, the use of an architecture based 
on the hierarchical Dirichlet process of Teh ct al. (2006) ; specifically we could use the HDP- 
AR-HMMs of Fox et al. (2011a) tied together with a shared set of transition and dynamic 
parameters. For an HDP-AR-HMM truncated to L possible dynamical modes, this model 
is specified as: 

/3^Dir(7/L,...,7/L) 
TTj \ (3 ^ Dir(Q!^i, . . . , af3j_i,a/3j + k, a/3j+i, . . . , a/?^) 

Here, a and 7 are a set of concentration parameters that define the HDP and k is the sticky 
hyperparameter of the sticky HDP-HMM (Fox et al., 2011b); these hyperparameters are 
often given priors as well. 



Segmentation Performance 
To demonstrate the diflFerence between this HDP-AR-HMM and the BP-AR-HMM, we 
generated data for three switching AR(1) processes. The first two time series, with four 
times the data points of the third, switched between dynamical modes defined by ak € 
{—0.8, —0.4, 0.8} and the third time series used G {—0.3, 0.8}. The results shown in 
Fig. 4 indicate that the multiple HDP-AR-HMM model, which assumes all time series 
share exactly the same transition matrices and dynamic parameters, typically describes the 
third time series using Ofc G {—0.4, 0.8} since this assignment better matches the parameters 
defined by the other (lengthy) time series. This common grouping of two distinct dynamical 
modes leads to the large median and 90th Hamming distance quantiles shown in Fig. 4(b). 
The BP-AR-HMM, on the other hand, is better able to distinguish these dynamical modes 
(see Fig. 4(c)) since the penalty in not sharing a behavior is only in the feature matrix; 
once a unique feature is chosen, it does not matter how the time series chooses to use it. 
Example segmentations representative of the median Hamming distance error are shown in 
Fig. 4(d)- (e). These results illustrate that the IBP-based feature model emphasizes choosing 
behaviors rather than assuming all time series are performing minor variations of the same 
dynamics. 

For the experiments above, we placed a Gamma(l, 1) prior on a and 7, and a Gamma(100, 1) 
prior on k. The gamma proposals used = 1 and cr^ = 100 while the MNIW prior was 
given M = 0, K — 0.1* Id, uq — d + 2, and 5*0 set to 0.75 times the empirical variance of the 
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Fig. 4. (a) Observation sequences for eacti of 3 switching AR(1) time series colored by true mode 
sequence, and offset for clarity. The first and second sequences are four times as long as the 
third, (b)-(c) Focusing solely on the third time series, the median (solid blue) and lO"' and 90"' 
quantiles (dashed red) of Hamming distance between the true and estimated mode sequence over 
1000 trials are displayed for the multiple HDP-AR-HMM model (Fox et al., 201 1a) and the BP-AR- 
HIVIM, respectively, (d)-(e) Examples of typical segmentations into behavior modes for the three 
time series at MCMC iteration 1000 for the two models. The top and bottom panels display the 
estimated and true sequences, respectively, and the color coding corresponds exactly to that of (a). 
For example, time series 3 switches between two modes colored by cyan and maroon. 
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Fig. 5. Histogram of the predictive log-lil<elihood of 100 field-out data using the inferred parameters 
sampled every 10th iteration from IViCMC iterations 5001,000 from 50 independent chains for the 
BP-AR-HIVIIVl and HDP-AR-HMM run on the data of Fig. 4(a). 

joint set of first-difference observations. At initialization, each time series was segmented 
into five contiguous blocks, with feature labels unique to that sequence. 



Predictive Performance 

Using the same data-generating mechanism as used to generate the time series displayed 
in Fig. 4(a), we generated a set of 100 held-out test datasets for Objects 1, 2, and 3. Each 
of the time series comprising the test datasets was of length 1000 (in contrast to the data 
of Fig. 4(a) in which the time series of Object 3 was of length 500 and those of Objects 
1 and 2 were of length 2000.) Based on a set of samples taken from 50 chains at MCMC 
iterations [500 : 10 : 1000] (i.e., a total of 2500 samples), we computed the log-hkelihood of 
each of the 100 held-out datasets. That is, we added the time-series-specific log-likelihoods 
computed for each time series since the time series are conditionally independent given 
the model parameters. We performed this task for both the MCMC samples of the BP- 
AR-HMM and HDP-AR-HMM. The resuhs are summarized in the histogram of Fig. 5. 

Since the BP-AR-HMM consistently identifies the unique dynamical mode of ak — —0.3 
used by Object 3 while the HDP-AR-HMM does not, we see from Fig. 5 that the mass of 
the BP-AR-HMM predictive log-likelihood is shifted positively by roughly 100 compared to 
that of the HDP-AR-HMM. In addition, we see that the histogram for the HDP-AR-HMM 
has a heavy tail, skewed towards lower log-likelihood, whereas the BP-AR-HMM does not. 

Recall a couple of key differences between the BP-AR-HMM and HDP-AR-HMM. Both 
the HDP-AR-HMM and BP-AR-HMM define global libraries of infinitely many possible 
dynamic behaviors. However, the HDP-AR-HMM assumes that each of the time series 
selects the same finite subset of behaviors and transitions between them in exactly the 
same manner (i.e., the transition matrix is also global.) On the other hand, the BP-AR- 
HMM allows each time series to select differing subsets of behaviors and differing transition 
probabilities. In the dataset examined here, the data-generating transition matrix between 
behaviors is the same for all time series, which matches the assumption of the HDP-AR- 
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HMM. Second, two of the three time series share exactly the same dynamical modes, which 
is also close to the assumed HDP-AR-HMM formulation. The only aspect of the data that 
is better modeled apriori by the BP-AR-HMM is the unique dynamical mode of Object 3. 
However, there is not a large difference between this unique dynamic of — —0.3 and the 
HDP-AR-HMM assumed = —0.4. Regardless of the fact that the data are a close fit to 
the assumptions made by the HDP-AR-HMM, the improved predictive log-likelihood of the 
BP-AR-HMM illustrates the benefits of this more flexible framework. 



7. Motion Capture Experiments 

The linear dynamical system is a common model for describing simple human motion (Hsu 
et al., 2005), and the switching linear dynamical system (SLDS) has been successfully ap- 
plied to the problem of human motion synthesis, classification, and visual tracking (Pavlovic 
et al., 1999, 2001). Other approaches develop non-linear dynamical models using Gaussian 
processes (Wang et al., 2008) or based on a collection of binary latent features (Taylor et al., 
2007). However, there has been little effort in jointly segmenting and identifying common 
dynamic behaviors amongst a set of multiple motion capture (MoCap) recordings of people 
performing various tasks. The BP-AR-HMM provides a natural way to handle this problem. 
One benefit of the proposed model, versus the standard SLDS, is that it does not rely on 
manually specifying the set of possible behaviors. As an illustrative example, we examined 
a set of six CMU MoCap exercise routines (CMU, 2009), three from Subject 13 and three 
from Subject 14. Each of these routines used some combination of the following motion 
categories: running in place, jumping jacks, arm circles, side twists, knee raises, squats, 
punching, up and down, two variants of toe touches, arch over, and a reach out stretch. 

From the set of 62 position and joint angles, we selected the following set of 12 mea- 
surements deemed most informative for the gross motor behaviors we wish to capture: one 
body torso position, two waist angles, one neck angle, one set of right and left shoulder 
angles, the right and left elbow angles, one set of right and left hip angles, and one set of 
right and left ankle angles. The CMU MoCap data are recorded at a rate of at 120 frames 
per second, and as a preprocessing step we block-average and downsample the data using a 
window size of 12. We additionally scale each component of the observation vector so that 
the empirical variance on the concatenated set of first difference measurements is equal to 
one. Using these measurements, the prior distributions were set exactly as in the synthetic 
data experiments except the scale matrix, 5*0, of the MNIW prior which was set to 5 • /12 
(i.e., five times the empirical covariance of the preprocessed first-difference observations, 
and maintaining only the diagonal.) This setting allows more variability in the observed 
behaviors. We ran 25 chains of the sampler for 20,000 iterations and then examined the 
chain whose segmentation minimized an expected Hamming distance to the set of segmen- 
tations from all chains over iterations 15,000 to 20,000. This method of selecting a sample, 
first introduced in Fox et al. (2011b), is outlined as follows. We first choose a large reference 
set TZ of state sequences produced by the MCMC sampler and a possibly smaller set of test 
sequences T- Then, for each collection of state sequences 2^"' in the test set T (with z^"^ 
being the MCMC sample of 2; = {^^j!^} at iteration n), we compute the empirical mean 
Hamming distance between the test sequence and the sequences in the reference set TZ; we 
denote this empirical mean by Hn- We then choose the test sequence 2; I" 1 that minimizes 
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Fig. 6. Each skeleton plot displays the trajectory of a learned contiguous segment of more than two 
seconds. To reduce the number of plots, we preprocessed the data to bridge segments separated 
by fewer than 300 msec. The boxes group segments categorized under the same behavior label, 
with the color indicating the true behavior label (allowing for analysis of split behaviors). Skeleton 
rendering done by modifications to Neil Lawrence's IVIatlab MoCap toolbox (Lawrence, 2009). 



this expected Hamming distance. That is, 

1 — arg min _ff„. 

zi"ier 

The empirical mean Hamming distance is a label-invariant loss function since it does 
not rely on labels remaining consistent across samples — we simply compute 

= E Hamm(z["U'"l), 

where Hamm(2;["l , 2;[™1) is the Hamming distance between sequences 2;["1 and 2;[™1 after 
finding the optimal permutation of the labels in test sequence to those in reference 
sequence 2 1™!. At a high level, this method for choosing state sequence samples aims to 
produce segmentations of the data that are typical samples from the posterior. Jasra et al. 
(2005) provides an overview of some related techniques to address the label-switching issue. 

The resulting MCMC sample is displayed in Fig. 6. Each skeleton plot depicts the 
trajectory of a learned contiguous segment of more than two seconds, and boxes group 
segments categorized under the same behavior label by our algorithm. The color of the 
box indicates the true behavior label. From this plot we can infer that although some true 
behaviors are split into two or more categories by our algorithm, the BP-AR-HMM shows 
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Fig. 7. Hamming distance versus number of GMM clusters / HMIVI states on raw observations 
(blue/green) and first-difference observations (red/cyan), witli the BP-AR-HIVIM segmentation (blacl<, 
liorizontal daslied) and true feature count (magenta, vertical dashed) shown for comparison. Results 
are for the most-likely of ten initializations of EM using an HMM Matlab toolbox (IVIurphy, 1 998). 

a clear ability to find common motions. Specifically, the BP-AR-HMM has successfully 
identified and grouped examples of jumping jacks (magenta), side twists (bright blue), 
arm circles (dark purple), squats (orange), and various motion behaviors that appeared in 
only one movie (bottom left four skeleton plots.) The split behaviors shown in green and 
yellow correspond to the true motion categories of knee raises and running, respectively, and 
the splits can be attributed to the two subjects performing the same motion in a distinct 
manner. For the knee raises, one subject performed the exercise while slightly twisting the 
upper in a counter-motion to the raised knee (top three examples) while the other subject 
had significant sidc-to-side upper body motion (middle three examples). For the running 
motion category, the splits also tended to correspond to varying upper body motion such 
as running with hands in or out of sync with knees. One example (bottom right) was the 
subject performing a lower-body run partially mixed with an upper-body jumping jack/arm 
flapping motion (an obviously confused test subject.) See Section 8 for further discussion 
of the BP-AR-HMM splitting phenomenon. 

We compare our MoCap performance to the Gaussian mixture model (GMM) method 
of Barbie et al. (2004) using expectation maximization (EM) initialized with k-means. Barbie 
et al. (2004) also present an approach based on probabilistic principal component analysis 
(PC A), but this method focuses primarily on change-point detection rather than behavior 
clustering. As further comparisons, we consider a GMM on first-difference observations, and 
an HMM on both data sets. In Fig. 7, we analyze the ability of the BP-AR-HMM, as com- 
pared to the defined GMMs and HMMs, in providing accurate labelings of the individual 
frames of the six movie clips |j . Specifically, we plot the Hamming distance between the true 
and estimated frame labels versus the number of GMM clusters and HMM states, using the 
most-likely of ten initializations of EM. We also plot the Hamming distance corresponding 
the BP-AR-HMM MCMC sample depicted in Fig. 6, demonstrating that the BP-AR-HMM 

II The ability to accurately label the frames of a large set of movies is useful for tasks such as 
querying an extensive MoCap database (such as that of CMU) without relying on manual labeling 
of the movies. 
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Fig. 8. Feature matrices associated with the true MoCap sequences (top-left), BP-AR-HIVIM esti- 
mated sequences over iterations 15,000 to 20,000 (top-right), and IVIAP assignment of the GIVIIVI 
(bottom-left) and HMM (bottom-right) using first-difference observations and 12 clusters/states. 

provides more accurate frame labels than any of these alternative approaches over a wide 
range of mixture model settings. The estimated feature matrices for the BP-AR-HMM 
and the GMM and HMM on first difference observations are shown in Fig. 8. The figure 
displays the matrix associated with the MAP label estimate in the case of the GMM and 
HMM, and an estimate based on MCMC samples from iterations 15,000 to 20,000 for the 
BP-AR-HMM. For the GMM and HMM, we consider the case when the number of Gaussian 
mixture components or the number of HMM states is set to the true number of behaviors, 
namely 12. By pooling all of the data, the GMM and HMM approaches assume that each 
time series exhibits the same structure; the results of this assumption can be seen in the 
strong bands of white implying sharing of behavior between the time series. The feature 
matrix estimated by the BP-AR-HMM, on the other hand, provides a much better match 
to the true matrix by allowing for sequence-specific variability. For example, this ability is 
indicated by the special structure of features in the upper right portion of the true feature 
matrix that is mostly captured in the BP-AR-HMM estimated feature matrix, but is not 
present in those of the GMM or HMM. We do, however, note a few BP-AR-HMM merged 
and split behaviors. Overall, we see that in addition to producing more accurate segmen- 
tations of the MoCap data, the BP-AR-HMM provides a superior ability to discover the 
shared feature structure. 



8. Discussion 

We have presented a Bayesian nonparametric framework for discovering dynamical modes 
common to multiple time series. Our formulation reposes on the beta process, which pro- 
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vides a prior distribution on overlapping subsets of binary features. This prior allows both 
for commonality and time-series-specific variability in the use of dynamical modes. We 
additionally developed a novel exact sampling algorithm for non-conjugate IBP models. 
The utility of our BP-AR-HMM was demonstrated both on synthetic data, and on a set 
of MoCap sequences where we showed performance exceeding that of alternative methods. 
Although we focused on switching VAR processes, our approach could be equally well ap- 
plied to HMMs, and to a wide range of other segmented dynamical systems models such as 
switching linear dynamic systems. 

The idea proposed herein of a feature-based approach to relating multiple time series 
is not limited to nonparametric modeling. One could just as easily employ these ideas 
within a parametric model that pre-specifies the number of possible dynamic behaviors. 
We emphasize, however, that conditioned on the infinite feature vectors of our BP-AR- 
HMM, our model reduces to a collection of Markov switching processes on a finite state 
space. The beta process simply allows for flexibility in the overall number of globally shared 
behaviors, and computationally we do not rely on any truncations of this infinite model. 

One area of future work is to develop split-merge proposals to further improve mixing 
rates for high-dimensional data. Although the block initialization of the time series helps 
with the issue of splitting merged behaviors, it does not fully solve the problem and cannot 
be relied upon in datasets with more irregular switching patterns than the MoCap data we 
considered. Additionally, splitting a single true behavior into multiple estimated behaviors 
often occurred. The root of the splitting issue is two-fold. One is due to the mixing rate 
of the sampler. The second, unlike in the case of merging behaviors, is due to modeling 
issues. Our model assumes that the dynamic behavior parameters (i.e., the VAR process 
parameters) are identical between time series and do not change over time. This assumption 
can be problematic in grouping related dynamic behaviors, and might be addressed via 
hierarchical models of behaviors or by ideas similar to those of the dependent Dirchlet 
process (MacEachern, 1998; Griffin and Steel, 2006) that allows for time- varying parameters. 

Overall, the MoCap results appeared to be fairly robust to examples of only slightly 
dissimilar behaviors (e.g., squatting to different levels, twisting at different rates, etc.) 
However, in cases such as the running motion where only portions of the body moved in 
the same way while others did not, we tended to split the behavior group. This observation 
motivates examination of local partition processes (Dunson, 2009, 2010) rather than global 
partition processes. That is, our current model assumes that the grouping of observations 
into behavior categories occurs along all components of the observation vector rather than 
just a portion (e.g., lower body measurements.) Allowing for greater flexibility in the 
grouping of observation vectors becomes increasingly important in high dimensions. 
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A. Appendix A: Sum-Product Algorithm for thie AR-IHIUIIVI 

A variant of the sum-product algorithm apphed specificahy to the chain graph of the AR- 
HMM provides an efficient method for computing the Ukehhood of the data marginahzing 
the latent HMM mode sequence. For the BP-AR-HMM of this paper, we compute the 
likelihood of each time series based on a fixed feature matrix assignment, which reduces the 
joint model to a finite collection of finite-dimensional AR-HMMs, each of which is described 
by its set of feature-constrained transition distributions along with the shared library of VAR 
parameters 0k = {Ak, Sfc}. The derivations provided in this appendix directly follow those 
for the standard HMM (Rabiner, 1989). First, we define a set of forward messages 

at{zt)=p{yi,---,yt,zt), (46) 

which satisfy the recursion 

at+i{zt+i) =p(yt+i I zt+uyt+i)^p{yi,...,yt I zt)p{zt+i \ zt)p{zt) (47) 

Zt 

= p{yt+i I zt+i,yt+i)^at{zt)p(zt+i I Zt) (48) 
= 7V(?/t+i; ) ^ at{zt)n^, {zt+i)- (49) 

The messages are initialized as 

ai{zi) =p(yi,yi,zi) =7V(yi;^3iyi,I]^j7ro(zi). (50) 

After running the recursion from t = 1, . . . , T, the desired likelihood is simply computed 
by summing over the components of the forward message at time T: 

p{yi,---,yT) =^(^t{zt)- (51) 

Note that for the BP-AR-HMM, at each step the forward message for time series i is 
computed by summing z^^^ over the finite collection of possible HMM mode indices specified 
by that time series's feature vector f^. 



B. Appendix B: Acceptance Ratio for Birth-Death Proposal 

Let us first consider a birth move in which we propose a transition from to -|- 1 unique 
features for time series i. As dictated by Eq. (31), the first rii proposed components of 0'_^_ 
and rj^ are equal to the previous parameters associated with those rij features. Namely, 
9'^ 1^ = O^^k and rj'^ ^ ~ ''l+,k for all k £ {1, . . . , rii}. The difference between the proposed and 
previous parameters arises from the fact that 0'_i_ and rj^ contain an additional component 
O'^^ni+i s-nd 77^_„._,_i, respectively, drawn from the prior distributions on these parameter 
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spaces. Then, the acceptance ratio is given by 
r(/;„0V,,7VI/+„0+,r,+) 

piyf^T, I [f-.fU(^i:K-^^(^+^v^'\v'+)p{fU)pK)pW+) 



P(yi:k I [f^.J+.lO,,j,-^.e+,rj(:^),ri+)p{UMO+)p{v+) 

qfif+i I f+Me{0'+ I fU^f+i,0+)qr,{v+ I /+i,/+i,»7+) 



(52) 



Noting that each component of the parameter vector and f]^ is drawn i.i.d., and plugging 
in the appropriate definitions for the proposal distributions, we have 

_ Piyfk I [/-,/;j,^i:K;M^+.^''^<)Poi«son(n, + l;a/A^)nrLVp(^;,Jp(<,J 
PiVik I [-^-^ »7+)Poisson(n,; a/iV) UlUPi^+,k)piv+.k) 



(53) 



We use the notation qf{k <— j) to denote the proposal probability of transitioning from j 
to k unique features. Using the fact that = G Oi.K:^. and r]'_^ ^ = rj+.k G J?^*-* for all 
k € {1, . . . , n^}, we can simplify the acceptance ratio to: 

PiYik I [/-./;j,^i:K+,eV,„.+i,r/«,r7V_„^+i)Poisson(n, + l;a/7V)g/(n, ^ + 1) 

^ m • (° 

p(yi:T. I [/-i/+j],6'i:K+,r7('))Poisson(n,,;Q;/iV)q/(nj + 1 ^ rii) 

The derivation of the acceptance ratio for a death move follows similarly. 
C. Appendix C: Acceptance Ratio for Transition Parameters 

Since the proposal distributions for 7 and k use fixed variance ct^ or cr^, and mean equal to 
the current hyperparameter value, we have 

97(- I 7) = Gamma ( ^ ) qni' I «) = Gamma ( \, ) ■ (55) 

Let TT {tt^*^}. To update 7 given K, the acceptance probability is min{r(7' | 7),1} with 
acceptance ratio 

, , I , p(7r I 7^K,F)p(y I a^,6^)g(7 | 7^g^) 
'^^^ I ^' p{7T I 7, «, F)p(7 I a^, 6^)g(7' | 7, a^) ' ^ ^ 

Recalling the definition of ir^^^ from Eq. (14) and that Ki — J^kfik^ the likelihood term 
may be written as 

/(7) - Pi. 1 7, ^, = n ft ( , — ft "S""^'"^ ^ ■ (57) 

vill (nf=7'r(7))r(7 + «)/ii ^ 



exp{-(7'-7)m- (58) 
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The ratio of the prior distributions reduces to 
Letting ■& — and = the ratio of the proposal distributions reduces to 

(77<)'''„,^'-l 



q(7 I y,a?^) -^W^r -iexp{-7^} _ Ti^)^'-^-^ ^(^-^0 



,(7'|7,a^) (^^,.-i,,p{_^,_^} r(,9')7'"-'"- " ■ ^^^^ 
Our acceptance ratio can then be compactly written as 

ril' I 7) = exp{-(y - 7)^^^ (60) 

The Metropolis-Hastings sub-step for sampling k given 7 follows similarly. In this case, 
however, the likelihood terms simplifies to 

f(^)^[[ Yh + K)^^ i-i- " ccpi7v\j,K,F). (61) 

i ' 7 = 1 



D. Appendix D: BP-AR-HMM MCMC Algorithm 

The overall MCMC sampler for the BP-AR-HMM is outlined in Algorithm 1. Note that 
Algorithm 2 is embedded within Algorithm 1. 



References 

Aach, J. and G. Church (2001). Aligning gene expression time series with time warping 
algorithms. Bioinformatics 17(6), 495-508. 

Alon, J., S. Sclaroff, G. KoUios, and V. Pavlovic (2003). Discovering clusters in motion 
time-series data. In IEEE Conference on Computer Vision and Pattern Recognition, 
Volume 1, pp. 375-381. 

Altman, R. (2007). Mixed hidden Markov models. Journal of the American Statistical 
Association 102(477), 201-210. 

Aoki, M. and A. Havenner (1991). State space modeling of multiple time series. Econometric 
Reviews 10(1), 1-59. 

Barbie, J., A. Safonova, J.-Y. Pan, C. Faloutsos, J. K. Hodgins, and N. S. Pollard (2004). 
Segmenting motion capture data into distinct behaviors. In Proc. Graphics Interface, pp. 
185-194. 



Beal, M., Z. Ghahramani, and C. Rasmussen (2002). The infinite hidden Markov model. 
In Advances in Neural Information Processing Systems, Volume 14, pp. 577-584. 



Joint Modeling of IVIuitipie Reiated Time Series via the Beta Process 29 



Given a previous set of time-series-specific transition variables {77'*^}'^" the dynamic 
parameters {A^, Efe}("~^\ and features 

(a) Set {,7W} = {r7W }("-!), {A^, E^} = {A^, SjC""!), andF-F^"-!'. 

(b) From the feature matrix F, create count vector m — [ mi TO2 . . . raKj^ ], with 
TOfc representing the number of time series possessing feature k. 

(c) For each i € {1, . . . , N}, sample features as follows: 

(i) Set = m — f^, and reorder columns of F so that the K^^ columns with 
m^* > appear first. Appropriately relabel indices of {Ak,^k} and {t?'-'-'}. 

(ii) For each shared feature fc S {1, . . . , K^^}, set / = fit and: 

A. Consider fik G {0, 1} and: 



A. Create feature-constrained transition distributions: 

Trf (x[ ... r/^^ ] ® 

B. Compute hkelihood ^/.^. (yi*Ti) — P (y^T; I {^fci ^fc}) using 
variant of the sum-product algorithm described in Appendix A. 



B. Compute 



^1 (yl:^.) 



^ minjp* 1} / = 0; 

7V-m- fo(yW_) \ mm{l/pM}, / = 1. 

C. Sample /.^ ~ p{f \ mf,k, I) + (1 - p{I \ f))5{UJ)- 
(iii) Let /j' = /j and calculate the number of unique features Ui — — K^^. 

A. Propose a birth or death move, each with probability 0.5. 

• Birth: sample {0'-^ n +i-V+.ni+i} from their priors and set /'„._|_i = 1, 
n'i = ni + 1. 

• Death: sample £ ^ uniform[A'^* + 1 : AT+J and set f -^ ^ Q, n[ — rii — 1. 

B. Compute likelihoods if. (^y[%.^ and If |^y[!^/) of data under the previous 
and proposed models, respectively. 

C. Keep {( = 1) or discard (C = 0) proposed model by sampling: 



C Ber(p) p 



(yi:T.) Poisson(nj | f ^ ni) 



(iv) Set m = m ^ + fi- Remove columns for which = 0, and appropriately 
redefine the dynamic parameters {Ak, Efe} and transition variables {r)^^^}. 

(d) Resample dynamic parameters {Ak, Efe} and transition variables {ry^*-*} using the 
auxiliary variable sampler of Algorithm 2. 

(e) Fix {r,W}(») = {r,«}, {A,,Efc}(") = {Ak,^k}, and F^"' = F. 



Algorithm 1: BP-AR-HMM MCMC sampler. 
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Given the feature-restricted transition distributions tt''^ and dynamic parameters 
{Ak,T.k}, update the parameters as follows: 

(a) For each i e {1, . . . , N}: 

(i) Block sample zj^!^. as follows: 

A. For each fc e {1, . . . , A%}, initialize messages to m^^-^ rp{k) = 1. 

B. For each t G {Ti, . . . , 1} and k E {1, . . . , K^}, compute 

C. Working sequentially forward in time, and starting with transitions counts 
n« = 0: 

A. Sample a mode assignment z^^^ as: 



K 



E ^|)^ (fc)-^ (y«; A,y«, Efe) mfl,,{k)S a) 



B. Increment ^ (i, 



fc=i 



Note that Trj*' (fc) is zero for any k such that fik = 0, implying that z^*-* = fc will 
never be sampled (as desired). Considering all indices simply allows for 
efScient matrix implementation, 
(ii) For each {j, /c) € {1, . . . , K+} x {1, . . . , K+}, sample 

'^jfc I 7 ^ Gamma(l, 7 + K6{j, k) + n^^l). 
(b) For each k e {l,...,K+}: 

(i) Form Yk = {yf^zf^ = k) and Yu = {yf^kf' = fc} and compute sf), ' 
4y , and S^y^ as in Eq. (37). 

(ii) Sample dynamic parameters: 



Algorithm 2: BP-AR-HMM auxiliary variable sampler for updating transition and dy- 
namic parameters. 
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